#cuda is the programming language | Explore Tumblr posts and blogs

govindhtech · 1 month ago

Text

LM Studio Improves LLM with CUDA 12.8 & GeForce RTX GPUs

LM Studio Accelerates LLM with CUDA 12.8 and GeForce RTX GPUs

The latest desktop application update improves model controls, dev tools, and RTX GPU performance.

As AI use cases proliferate, developers and hobbyists want faster and more flexible ways to run large language models (LLMs), from document summarisation to custom software agents.

Running models locally on PCs with NVIDIA GeForce RTX GPUs enables high-performance inference, data privacy, and AI deployment and integration management. Free programs like LM Studio let users examine and operate with LLMs on their own hardware.

LM Studio is a popular local LLM inference application. Based on the fast llama.cpp runtime, the application allows models to run offline and be utilised as OpenAI-compatible API endpoints in custom workflows.

LM Studio 0.3.15 uses CUDA 12.8 to boost RTX GPU model load and response times. The upgrade adds developer-focused features like a revised system prompt editor and “tool_choice” tool usage.

The latest LM Studio improvements improve usability and speed, enabling the highest throughput on RTX AI PCs. This leads to faster reactions, snappier interactions, and better local AI development and integration tools.

AI Acceleration Meets Common Apps

LM Studio can be used for light experimentation and significant integration into unique processes due to its versatility. Developer mode permits desktop chat or OpenAI-compatible API calls to models. Local LLMs can be integrated with custom desktop agents or processes in Visual Studio Code.

The popular markdown-based knowledge management tool Obsidian may be integrated with LM Studio. Local LLMs in LM Studio allow users to query their notes, produce content, and summarise research using community-developed plug-ins like Text Generator and Smart Connections. These plug-ins enable fast, private AI interactions without the cloud by connecting to LM Studio's local server.

Developer enhancements in 0.3.15 include an updated system prompt editor for longer or more sophisticated prompts and more accurate tool usage management through the “tool_choice” option.

The tool_choice argument lets developers require a tool call, turn it off, or allow the model decide how to connect with external tools. Adding flexibility to structured interactions, retrieval-augmented generation (RAG) workflows, and agent pipelines is beneficial. Together, these upgrades improve LLM use cases for developers in experimental and production.

LM Studio supports Gemma, Llama 3, Mistral, and Orca open models and quantisation formats from 4-bit to full precision.

Common use cases include RAG, document-based Q&A, multi-turn chat with long context windows, and local agent pipelines. Local inference servers run by the NVIDIA RTX-accelerated llama.cpp software package allow RTX AI PC users to simply integrate local LLMs.

LM Studio gives you full control, speed, and privacy on RTX, whether you're optimising a modest PC for efficiency or a big desktop for throughput.

Maximise RTX GPU Throughput

LM Studio's acceleration relies on the open-source runtime llama.cpp for consumer hardware inference. NVIDIA worked with LM Studio and llama.cpp to increase RTX GPU performance.

Important optimisations include:

CUDA graph enablement reduces CPU overhead and boosts model throughput by 35% by integrating GPU operations into a CPU call.

Flash attention CUDA kernels can boost throughput by 15% by improving LLM attention handling in transformer models. This improvement allows longer context windows without increasing memory or computing power.

Supports the newest RTX architectures: LM Studio's CUDA 12.8 update works with high-end PCs, so clients can deploy local AI processes from laptops. all RTX AI PCs, from GeForce 20 Series to NVIDIA Blackwell-class GPUs.

LM Studio automatically changes to CUDA 12.8 with a compatible driver, improving model load times and performance.

These improvements speed up response times and smooth inference on all RTX AI PCs, from small laptops to large desktops and workstations.

Utilise LM Studio

Linux, macOS, and Windows have free LM Studio. The recent 0.3.15 release and continual optimisations should improve local AI performance, customisation, and usability, making it faster, more versatile, and easier to use.

Developer mode offers an OpenAI-compatible API, and desktop chat allows users import models.

Start immediately by downloading and launching the latest LM Studio.

Click the left magnifying glass to open Discover.

See the CUDA 12 llama.cpp (Windows) runtime in the availability list after selecting Runtime choices on the left side. Click “Download and Install”.

After installation, select CUDA 12 llama.cpp (Windows) from the Default Selections selection to set LM Studio to use this runtime.

To optimise CUDA execution in LM Studio, load a model and click the gear icon to the left of it to open Settings.

Drag the “GPU Offload” slider to the right to offload all model layers to the GPU, then enable “Flash Attention” from the selection menu.

Local NVIDIA GPU inference is possible if these functions are enabled and configured.

LM Studio supports model presets, quantisation formats, and developer options like tool_choice for exact inference. The llama.cpp GitHub project is continually updated and evolving with community and NVIDIA performance enhancements for anyone who wants to contribute.

LM Studio 0.3.15 offers RTX 50-series GPUs and API tool utilisation improvements

A stable version of LM Studio 0.3.15 is available. This release supports NVIDIA RTX 50-series GPUs (CUDA 12) and UI changes include a revamped system prompt editor. Added possibility to log each fragment to API server logs and improved tool use API support (tool_choice parameter).

RTX 50-series GPU CUDA 12 compatibility

With llama.cpp engines, LM Studio supports RTX 50-series GPUs CUDA 12.8 for Linux and Windows. As expected, this improvement speeds up RTX 50-series GPU first-time model load times. LM Studio will update RTX 50-series GPUs to CUDA 12 if NVIDIA drivers are acceptable.

The minimum driver version is:

Windows version 551.61+

Linux: 550.54.14 minimum

LM Studio will immediately update to CUDA 12 if the driver version matches your RTX 50-series GPU. LM Studio uses CUDA 11 even with incompatible RTX 50 GPU drivers. Controlled by Command+Shift+R.

New System Prompt Editor UI

System suggestions change model behaviour well. They range from a few words to several pages. LM Studio 0.3.15 adds a larger visual space for modifying long prompts. The sidebar's little prompt editor works.

Improved Tool Use API Support

The OpenAI-like REST API now supports tool_choice, which helps you configure model tool use. The tool_choice argument has three values:

“tool_choice”: “none” means the model will call no tools.

“tool_choice”: “auto” The model decides whether to invoke tools with the option.

tool_choice: “required” Just output tools (llama.cpp engines)

NVIDIA also fixed LM Studio's OpenAI-compatibility mode bug that prohibited the chunk “finish_reason” from being changed to “tool_calls”.

Preview Community Presets

Presets combine system prompts with model parameters.

Since LM Studio 0.3.15, you can download and share user-made presets online. Additionally, you can like and fork other settings.

Settings > General > Enable publishing and downloading presets activates this option.

Right-clicking a sidebar setting reveals a “Publish” button once activated. Share your preset with the community.

#CUDA128 #LMStudio #CUDA12 #LMStudio0315 #RTX50seriesGPUs #NVIDIAGeForceRTXGPUs #technology #TechNews #technologynews #news #govindhtech

0 notes

souhaillaghchimdev · 2 months ago

Text

Parallel Programming Fundamentals

As computing power advances, developers are looking for ways to make applications faster and more efficient. One powerful approach is parallel programming, which allows programs to perform multiple tasks simultaneously, significantly reducing execution time for complex operations.

What is Parallel Programming?

Parallel programming is a programming model that divides a task into smaller sub-tasks and executes them concurrently using multiple processors or cores. This differs from sequential programming, where tasks are executed one after the other.

Key Concepts

Concurrency: Multiple tasks make progress over time (may or may not run simultaneously).

Parallelism: Tasks are executed truly simultaneously on multiple cores or processors.

Threads and Processes: Units of execution that can run independently.

Synchronization: Ensuring data consistency when multiple threads access shared resources.

Race Conditions: Unintended behavior caused by unsynchronized access to shared data.

Languages and Tools

Python: multiprocessing, threading, concurrent.futures

C/C++: POSIX threads (pthreads), OpenMP, CUDA for GPU parallelism

Java: Threads, ExecutorService, Fork/Join Framework

Go: Built-in goroutines and channels for lightweight concurrency

Simple Example in Python

import concurrent.futures import time def worker(n): time.sleep(1) return n * n with concurrent.futures.ThreadPoolExecutor() as executor: results = executor.map(worker, range(5)) for result in results: print(result)

Types of Parallelism

Data Parallelism: Splitting data into chunks and processing in parallel.

Task Parallelism: Different tasks running concurrently on separate threads.

Pipeline Parallelism: Tasks divided into stages processed in sequence but concurrently.

Benefits of Parallel Programming

Faster execution of large-scale computations

Better CPU utilization

Improved application performance and responsiveness

Challenges to Consider

Complex debugging and testing

Race conditions and deadlocks

Overhead of synchronization

Scalability limitations due to hardware or software constraints

Real-World Use Cases

Scientific simulations

Image and video processing

Machine learning model training

Financial data analysis

Gaming engines and real-time applications

Conclusion

Parallel programming is a game-changer for performance-critical software. While it introduces complexity, mastering its principles opens the door to high-speed, scalable applications. Start small with basic threading, then explore distributed and GPU computing to unlock its full potential.

#programming

0 notes

aionlinemoney · 6 months ago

Text

Top AI Companies Shaping the Future of the World

Artificial Intelligence driving the world and helping us to develop our daily lives. Across various industries, We will see top AI companies leading the way by changing how we work, communicate, and solve problems. These companies are creating advanced technologies that can do important tasks and bring innovative solutions in areas like healthcare, transportation, and more. Let’s explore some top AI companies that are driving and shaping the future.

What are Top AI Companies that are transforming the world:

OpenAI: A Leader in Generative AI

OpenAI, a company that was founded in 2015, is an important player in Artificial intelligence advanced technology. Known for tools like GPT-4 and DALL·E, OpenAI has acquired worldwide attention for its innovative technology. Its large language models are used in chatbots, content creation, and also in programming.

Beyond their advanced technology innovations, OpenAI plays a major role in promoting ethical AI use. ChatGPT is widely used by businesses and individuals, the company focuses on making AI tools accessible, helpful, and safe. By driving smarter decisions and creative solutions, OpenAI is not only transforming industries but also opening doors for new businesses, that’s why OpenAI comes under top AI companies.

Google DeepMind: Advancing AI for a Better World

Google DeepMind, this company focuses on using Google artificial intelligence to tackle real-world challenges. From mastering complex games like Go to solving major advanced technology scientific problems, DeepMind achievements are amazing.

One of its most important projects is AlphaFold, which solved the mystery of protein folding. This development is transforming drug discovery and accelerating progress in healthcare. By using Google artificial intelligence to drive societal progress, DeepMind defines how advanced technology can be developed to benefit humanity. So, these companies come in top AI companies which are driving the world through advanced technologies.

NVIDIA AI: Driving Artificial intelligence with Advanced Hardware

NVIDIA AI plays a key role in Artificial intelligence advanced technology by providing the powerful advanced technology hardware it needs. Famous for its GPUs (Graphics Processing Units), NVIDIA AI supports artificial intelligence research and applications across industries. That’s why this company comes as a top AI companies.

Its CUDA platform helps researchers train complex neural networks quickly, while tools like NVIDIA AI Omniverse enable virtual simulations. NVIDIA AI advance technology is not just innovating for today, it’s building the foundation for AI’s future. From self-driving cars to gaming, its impact is vast, making it a crucial player in the AI revolution. That’s why this company comes under top AI companies.

Tesla: Leading the Way in Advanced Technology

Tesla is a colonist in using AI for transportation. Under Elon Musk’s leadership, the company has revolutionized electric vehicles by combining sustainable energy with advanced AI. This company is among the top AI companies in the world.

Tesla’s Full Self-Driving (FSD) software displays its vision for autonomous travel. By leveraging neural networks and real-time data, Tesla vehicles can handle complex driving situations, covering the way for safer and more efficient transportation. While full autonomy is still a work in progress, Tesla innovations have significantly pushed the boundaries what’s possible.

Microsoft: Advancing AI Through Collaboration

Microsoft the famous company has integrated Artificial intelligence into its products to boost productivity and teamwork. By partnering with OpenAI, it has brought GPT technology to tools like Word and Excel, making everyday tasks simpler and more efficient.

Through Azure AI, its cloud-based platform, Microsoft helps developers create AI-powered applications across industries like healthcare and education. With a strong commitment to ethical AI practices, Microsoft continues to be a trusted leader, driving innovation while ensuring responsible use of technology.

Baidu: The AI Leader in China

Baidu, known as the “Google of China,” which is a powerhouse in AI technology innovation. From autonomous driving to voice recognition, Baidu is leading AI development in Asia.

The company’s Apollo project has made significant progress in self-driving technology, with multiple partnerships to deploy autonomous vehicles in real world. Additionally, Baidu’s AI-powered search engine and voice assistant cater to millions of users, making it a critical player in the global AI landscape. That’s why Baidu is in the top AI companies that are driving the world through their advanced technologies growth.

Artificial Intelligence’s Impact and Responsibility

AI companies are reshaping industries, solving complex problems, and creating new opportunities, from healthcare to transportation. Top AI company’s innovations are building a smarter and more well-organized world.

However, with innovation comes the responsibility to ensure ethical and inclusive use. Whether it’s OpenAI’s generative tools or NVIDIA AI advanced hardware, these advancements highlight AI’s potential to benefit all of humanity.

Conclusion

AI is transforming the world and these companies are leading the way to change the world. From OpenAI creative tools to Tesla’s self-driving cars, they are solving problems and creating new opportunities. Their work shows how AI can make life easier, safer, and more efficient. Read more AI-related News and Blogs only at AiOnlineMoney.

#aionlinemoney.com

#aionlinemoney #artificial intelligence #technology #machine learning

0 notes

ciotechviews · 8 months ago

Text

On Thursday, AMD officially released its newest AI chip: Instinct MI325X. The company has positioned this product directly to take on Nvidia’s dominant data center GPUs. The MI325X pits against the upcoming Blackwell chips from Nvidia, whose shipment will begin early next year. This is AMD’s strategic move to seize more considerable shares of the booming AI chip space that Bloomberg places at $500 billion by 2028.

AMD has always been second in the race in data centre GPU. For the MI325X, the company is looking to force the hand of Nvidia. After all, Nvidia currently enjoys over 90% market share here. This could shake the price of Nvidia’s products, which have been enjoying high margins with the soaring demand for its GPUs, largely because of AI applications. Chief among them is OpenAI’s ChatGPT.

The demand for AI has been higher than expected and investments grow super-duper quickly in the industry,” said CEO Lisa Su. AMD did not introduce new major cloud customers, but it does partner with Meta and Microsoft and it is so far supplying its AI chips in some OpenAI applications.

The biggest challenge AMD is facing currently is Nvidia’s proprietary CUDA programming language, which has become the new standard for AI developers. To avoid getting locked into Nvidia’s own ecosystem, AMD has been working on increasing capabilities of ROCm software to make it easy for developers who are already on the Nvidia side. The MI325X advances its performance to up to 40% more than that of the Nvidia H200 by running Meta’s Llama AI models with much advanced memory.

The company also debuted its 5th Gen EPYC CPUs, further solidifying its place within the data center. These announcements suggest that AMD is planning to go head-to-head with both Nvidia and Intel rather assertively in the AI and data center spaces.

#AMDLaunches #MI325XAIChip #NvidiaBlackwell #ciotechview

0 notes

enterprisewired · 8 months ago

Text

AMD Unveils AI Chip to Challenge Nvidia’s Dominance in Data Centers

Source – marktechpost.com

AMD’s AI Chip Launch Targets Nvidia’s Stronghold

On Thursday, AMD introduced a new artificial intelligence (AI) chip, the Instinct MI325X, marking a strategic move to compete with Nvidia’s dominant data center graphics processors (GPUs). The chip is expected to begin production by the end of 2024, AMD announced during the product launch event. This new release is seen as a direct challenge to Nvidia, which has maintained a commanding lead in the GPU market, especially as demand for AI technology continues to rise.

GPUs, like those produced by Nvidia, are essential for powering advanced generative AI models such as OpenAI’s ChatGPT. These models require large data centers filled with GPUs to handle massive amounts of processing. While Nvidia currently dominates this sector, AMD holds the second position and is determined to gain a larger share of the market, which it predicts will reach $500 billion by 2028.

During the event, AMD’s CEO, Lisa Su, expressed optimism about the growing demand for AI technology. “AI demand has actually continued to take off and exceed expectations. It’s clear that the rate of investment is growing everywhere,” Su said. While no new major cloud or internet customers were revealed, AMD has previously mentioned that companies like Meta, Microsoft, and OpenAI use its AI GPUs for specific applications.

Competition Heats Up Between AMD and Nvidia

The launch of the MI325X is a significant step in AMD’s efforts to close the gap with Nvidia, especially as both companies continue to push the boundaries of AI chip technology. The MI325X will compete directly with Nvidia’s forthcoming Blackwell chips, set to start shipping in early 2025. However, a key hurdle for AMD lies in Nvidia’s proprietary software programming language, CUDA, which has become the industry standard for AI developers. This language effectively locks developers into Nvidia’s ecosystem, making it difficult for them to transition to AMD products.

In response, AMD has been enhancing its own software, known as ROCm, which aims to simplify the process of shifting AI models to its chips. According to AMD, their AI accelerators offer a competitive edge in use cases where AI models are generating content or making predictions. One of the highlights of AMD’s new chip is its ability to outperform some Nvidia chips in serving Meta’s Llama AI model, delivering up to 40% more inference performance on specific tasks, Su noted.

AMD’s strategy to release new AI chips annually, starting with the MI325X, is part of its broader effort to capitalize on the AI boom and compete more aggressively with Nvidia. In the coming years, AMD plans to release the MI350 in 2025 and the MI400 in 2026.

Financial Impact and AMD’s Broader Plans

Despite the excitement surrounding the new chip, AMD’s stock dropped by 4% following the announcement, while Nvidia’s stock rose by 1%. AMD’s market presence in the AI chip sector remains significantly smaller than Nvidia’s, with Nvidia controlling over 90% of the data center AI chip market. However, investors may be drawn to AMD’s expanding role in AI, particularly if the MI325X launch proves successful.

In addition to AI chips, AMD remains a major player in central processing units (CPUs), which are at the core of nearly every server. The company reported that its data center sales more than doubled over the past year, reaching $2.8 billion in the June quarter. Of that, AI chips accounted for roughly $1 billion.

As part of its broader strategy to expand its footprint in data centers, AMD also introduced a new line of CPUs, called EPYC 5th Gen, which comes in various configurations. These CPUs range from low-cost, 8-core chips priced at $527 to 192-core processors intended for high-performance supercomputers, costing up to $14,813 per chip. According to AMD, these new CPUs are particularly well-suited for feeding data into AI workloads, a critical function for the growing AI landscape.

AMD’s efforts to compete with Nvidia, especially in AI technology, may face challenges, but with new innovations and product releases, the company is positioning itself to play a larger role in the future of AI and data centers.

#AMD Unveils AI Chip to Challenge Nvidia

0 notes

avandelay20 · 10 months ago

Text

Summarized by Bing Chat:

Eric Schmidt’s talk on “The Age of AI” at Stanford ECON295/CS323.

Introduction

Eric Schmidt, former CEO of Google and founder of Schmidt Futures, begins his talk by discussing the rapid advancements in artificial intelligence (AI) and its profound implications for the future. He emphasizes the importance of staying updated on AI developments due to the fast-paced nature of the field. Schmidt’s extensive experience in the tech industry provides a unique perspective on the transformative potential of AI.

Short-Term AI Developments

In the short term, Schmidt highlights the concept of a “million-token context window.” This refers to the ability of AI models to process and understand vast amounts of information simultaneously. This advancement is expected to significantly enhance AI capabilities within the next one to two years. Schmidt explains that this development will enable AI systems to handle more complex tasks and provide more accurate and contextually relevant responses.

AI Agents and Text-to-Action

Schmidt delves into the technical definitions of AI agents and the concept of text-to-action. AI agents are specialized programs designed to perform specific tasks autonomously. Text-to-action involves converting text inputs into actionable commands, such as programming in Python. Schmidt illustrates this concept with examples, demonstrating how AI can streamline various processes and improve efficiency in different domains.

The Dominance of Python and New Programming Languages

Python has long been the dominant programming language in the AI community due to its simplicity and versatility. Schmidt introduces a new language called Mojo, which aims to address some of the challenges associated with AI programming. While he acknowledges the potential of Mojo, Schmidt expresses skepticism about whether it will surpass Python’s dominance. He emphasizes the importance of continuous innovation in programming languages to keep pace with AI advancements.

Economic Implications of AI

The economic impact of AI is a significant focus of Schmidt’s talk. He discusses the reasons behind NVIDIA’s success in the AI market, attributing the company’s $2 trillion valuation to its CUDA optimizations. These optimizations are crucial for running AI code efficiently, making NVIDIA a key player in the AI hardware industry. Schmidt also explores the broader economic implications of AI, including its potential to disrupt traditional industries and create new opportunities for growth.

AI in Business and Society

Schmidt concludes his talk by discussing the broader implications of AI for businesses and society. He emphasizes the need for organizations and individuals to adapt to the rapidly changing AI landscape. Schmidt highlights the importance of ethical considerations in AI development and deployment, stressing the need for responsible AI practices to ensure positive outcomes for society.

Conclusion

In summary, Eric Schmidt’s talk on “The Age of AI” provides valuable insights into the current state and future potential of artificial intelligence. He covers a wide range of topics, from technical advancements and programming languages to economic implications and ethical considerations. Schmidt’s expertise and experience offer a comprehensive overview of the transformative power of AI and its impact on various aspects of our lives.

#eric schmidt #stanford #econ295 #age of ai #bingchat #microsoft #ai #google #cuda #python #nvidia #mojo #disruption #ethics

0 notes

erpinformation · 1 year ago

Link

#pytorch #machinelearning

0 notes

yescs2020 · 1 year ago

Text

talking the python language

this is so typical geek type conversation

and with the very popular python programing language causing an issue

User 1 has a issue and cant figure u it out so he asks for help on reddit

here is the only reply he gets:

"you are trying to run before knowing how to walk. This is probably not what you want to hear but it seems you are lacking basic knowledge of python programming. It would make sense for you to focus on mastering the basics first. The error pretty clearly tells you that a package is missing. Look into package managers live venv or anaconda/miniconda."

User 1 then asks: 3 days ago Thanks for replying. Can you please tell what the exact "package" is missing? If that is msd_pytorch, then that is already there in the environment. Possible explanation could be not matching the version of cuda that is required by msd_pytorch.

So, which exact package is missing. And if you get that, please tell where to install that. Cause I find nothing as mentioned in this "error" in pip or conda or the whole internet…

here is the follow up now to the first reply he got:

3 days ago Yes, the cuda version mismatch seems to be the problem. Have you tried reinstalling pytorch in a conda environment following the instructions from the pytorch page and then installing this msd_pytorch with pip in the "clean" conda environment?

hmmm so did u follow this reply at all????? or did u get left too without a pip in a "clean" conda envrionment?????

maybe that is why the "whole internet" has nothing to say except u have reached the end of the internet

yeah my blog my rules my esoteric posts that keeps tumblr fun

0 notes

jetsonhacks · 1 year ago

Text

CUDA Programming in Python with Numba

Faster Python and write CUDA code in Python? Yes Please!

CUDA programming in Python is a good way to start learning to leverage the power of GPU hardware. At the same time, wouldn’t it be nice to be able to speed up those bottleneck Python functions? Looky here: Background Python has a reputation for being ‘slow’. In part this is because it is an interpreted language. That means that there is software emulating a machine on hardware. The alternative,…

View On WordPress

0 notes

myprogrammingsolver · 1 year ago

Text

Computer Architecture Homework 4 Solution

Description The goal of this homework is to practice GPU programming using the CUDA language and experience first hand improving the performance of GEMM (General Matrix Multiplication) which is a staple in many important applications such as machine learning. Building and Running Environment Setup The homework is setup to run with the nvcc compiler (NVIDIA CUDA compiler) and a Make build system.…

View On WordPress

0 notes

govindhtech · 8 months ago

Text

SYCL 2020’s Five New Features For Modern C++ Programmers

SYCL

For accelerator-using C++ programmers, SYCL 2020 is interesting. People enjoyed contributing to the SYCL standard, a book, and the DPC++ open source effort to integrate SYCL into LLVM. The SYCL 2020 standard included some of the favorite new features. These are Intel engineers’ views, not Khronos’.

Khronos allows heterogeneous C++ programming with SYCL. After SYCL 2020 was finalized in late 2020, compiler support increased.

SYCL is argued in several places, including Considering a Heterogeneous Future for C++ and other materials on sycl.tech. How will can allow heterogeneous C++ programming with portability across vendors and architectures? SYCL answers that question.

SYCL 2020 offers interesting new capabilities to be firmly multivendor and multiarchitecture with to community involvement.

The Best Five

A fundamental purpose of SYCL 2020 is to harmonize with ISO C++, which offers two advantages. First, it makes SYCL natural for C++ programmers. Second, it lets SYCL test multivendor, multiarchitecture heterogeneous programming solutions that may influence other C++ libraries and ISO C++.

Changing the base language from C++11 to C++17 allows developers to use class template argument deduction (CTAD) and deduction guides, which necessitated several syntactic changes in SYCL 2020.

Backends allow SYCL to target more hardware by supporting languages/frameworks other than OpenCL.

USM is a pointer-based alternative to SYCL 1.2.1’s buffer/accessor concept.

A “built-in” library in SYCL 2020 accelerates reductions, a frequent programming style.

The group library abstracts cooperative work items, improving application speed and programmer efficiency by aligning with hardware capabilities (independent of vendor).

Atomic references aligned with C++20 std::atomic_ref expand heterogeneous device memory models.

These enhancements make the SYCL ecosystem open, multivendor, and multiarchitecture, allowing C++ writers to fully leverage heterogeneous computing today and in the future.

Backends

With backends, SYCL 2020 allows implementations in languages/frameworks other than OpenCL. Thus, the namespace has been reduced to sycl::, and the SYCL header file has been relocated from to .

These modifications affect SYCL deeply. Although implementations are still free to build atop OpenCL (and many do), generic backends have made SYCL a programming approach that can target more diverse APIs and hardware. SYCL can now “glue” C++ programs to vendor-specific libraries, enabling developers to target several platforms without changing their code.

SYCL 2020 has true openness, cross-architecture, and cross-vendor.

This flexibility allows the open-source DPC++ compiler effort to support NVIDIA, AMD, and Intel GPUs by implementing SYCL 2020 in LLVM (clang). SYCL 2020 has true openness, cross-architecture, and cross-vendor.

Unified shared memory

Some devices provide CPU-host memory unified views. This unified shared memory (USM) from SYCL 2020 allows a pointer-based access paradigm instead of the buffer/accessor model from SYCL 1.2.1.

Programming with USM provides two benefits. First, USM provides a single address space across host and device; pointers to USM allocations are consistent and may be provided to kernels as arguments. Porting pointer-based C++ and CUDA programs to SYCL is substantially simplified. Second, USM allows shared allocations to migrate seamlessly between devices, enhancing programmer efficiency and compatibility with C++ containers (e.g., std::vector) and algorithms.

Three USM allocations provide programmers as much or as little data movement control as they want. Device allocations allow programmers full control over application data migration. Host allocations are beneficial when data is seldom utilized and transporting it is not worth the expense or when data exceeds device capacity. Shared allocations are a good compromise that immediately migrate to use, improving performance and efficiency.

Reductions

Other C++ reduction solutions, such as P0075 and the Kokkos and RAJA libraries, influenced SYCL 2020.

The reducer class and reduction function simplify SYCL kernel variable expression using reduction semantics. It also lets implementations use compile-time reduction method specialization for good performance on various manufacturers’ devices.

The famous BabelStream benchmark, published by the University of Bristol, shows how SYCL 2020 reductions increase performance. BabelStream’s basic dot product kernel computes a floating-point total of all kernel work items. The 43-line SYCL 1.2.1 version employs a tree reduction in work-group local memory and asks the user to choose the optimal device work-group size. SYCL 2020 is shorter (20 lines) and more performance portable by leaving algorithm and work-group size to implementation.

Group Library

The work-group abstraction from SYCL 1.2.1 is expanded by a sub-group abstraction and a library of group-based algorithms in SYCL 2020.

Sub_group describes a kernel’s cooperative work pieces running “together,” providing a portable abstraction for various hardware providers. Sub-groups in the DPC++ compiler always map to a key hardware concept SIMD vectorization on Intel architectures, “warps” on NVIDIA architectures, and “wavefronts” on AMD architectures enabling low-level performance optimization for SYCL applications.

In another tight agreement with ISO C++, SYCL 2020 includes group-based algorithms based on C++17: all_of, any_of, none_of, reduce, exclusive_scan, and inclusive_scan. SYCL implementations may use work-group and/or sub-group parallelism to produce finely tailored, cooperative versions of these functions since each algorithm is supported at various scopes.

Atomic references

Atomics improved in C++20 with the ability to encapsulate types in atomic references (std::atomic_ref). This design (sycl::atomic_ref) is extended to enable address spaces and memory scopes in SYCL 2020, creating an atomic reference implementation ready for heterogeneous computing.

SYCL follows ISO C++, and memory scopes were necessary for portable programming without losing speed. Don’t disregard heterogeneous systems’ complicated memory topologies.

Memory models and atomics are complicated, hence SYCL does not need all devices to use the entire C++ memory model to support as many devices as feasible. SYCL offers a wide range of device capabilities, another example of being accessible to all vendors.

Beyond SYCL 2020: Vendor Extensions

SYCL 2020’s capability for multiple backends and hardware has spurred vendor extensions. These extensions allow innovation that provides practical solutions for devices that require it and guides future SYCL standards. Extensions are crucial to standardization, and the DPC++ compiler project’s extensions inspired various elements in this article.

Two new DPC++ compiler features are SYCL 2020 vendor extensions.

Group-local Memory at Kernel Scope

Local accessors in SYCL 1.2.1 allow for group-local memory, which must be specified outside of the kernel and sent as a kernel parameter. This might seem weird for programmers from OpenCL or CUDA, thus has created an extension to specify group-local memory in a kernel function. This improvement makes kernels more self-contained and informs compiler optimizations (where local memory is known at compile-time).

FPGA-Specific Extensions

The DPC++ compiler project supports Intel FPGAs. It seems that the modifications, or something similar, can work with any FPGA suppliers. FPGAs are a significant accelerator sector, and nous believe it pioneering work will shape future SYCL standards along with other vendor extension initiatives.

Have introduced FPGA choices to make buying FPGA hardware or emulation devices easier. The latter allows quick prototyping, which FPGA software writers must consider. FPGA LSU controls allow us to tune load/store operations and request a specific global memory access configuration. Also implemented data placement controls for external memory banks (e.g., DDR channel) to tune FPGA designs via FPGA memory channel. FPGA registers allow major tuning controls for FPGA high-performance pipelining.

Summary

Heterogeneity endures. Many new hardware alternatives focus on performance and performance-per-watt. This trend will need open, multivendor, multiarchitecture programming paradigms like SYCL.

The five new SYCL 2020 features assist provide portability and performance mobility. C++ programmers may maximize heterogeneous computing with SYCL 2020.

Read more on Govindhtech.com

#SYCL2020 #SYCL #C++#DPC++#OpenCL #SYCL1.2.1 #SYCLkernel #CUDA #News #Technews #Technology #Technologynews #Technologytrends #govindhtech

0 notes

nostalgebraist · 1 year ago

Text

I never can tell how many of the people who complain about how slow python is are coming at it disingenuously ignoring that numpy exists.

Yeah, like... as someone said on one of threads that inspired the OP, many of the popular use cases for python don't involve doing everything in python. They use python a high-level interface to libraries of efficient routines written in something like C++.

I guess someone could look at this and ask, "why not do the whole thing in C++, then?" And the answer is, there is no reason whatsoever to do the whole thing in C++. You wouldn't get any benefit from doing so, not even a performance benefit.

Instead, you have the best of both worlds. You get the ease and readability of python when you're constructing and issuing commands to the low-level routines; you get the performance of the low-level language when these commands are executed. It's a great division of labor.

Does this incur overheads from the slowness of python? Well, yeah, sometimes. But it's tiny; it's almost never the biggest drag on your program's performance, and if it is that's usually a red flag that you're doing something else wrong.

----

For example, a very popular way to do machine learning on GPUs involves writing python, which calls C++, which in turn calls even lower-level CUDA code that runs on the GPU.

The time cost of running a CUDA kernel is a sum of two terms:

the actual time spent inside the kernel

the time spent copying the inputs into the fast kind of memory that kernels use, and likewise copying the output back into the other kind of memory

The slowness of the memory copying step is a really big deal, and you want to run that step as few times as possible. This overhead -- which is a fact of hardware, independent of the host language used -- is a way bigger deal in practice than python overhead.

This overhead is greatest when your program calls a long list of short-lived kernels in sequence, each of which does relatively little work. (The memory copies are equally expensive if your kernel does a little work or a lot; they depend only on the size of the data, not how much processing was done to it.)

So you want to run as much of the program as possible in a single kernel, before handing control back to the host.

Achieving this often looks like taking two existing kernels (K1, K2) that you want to run -- one right after the other -- and "fusing" them into a single kernel.

Where the fused kernel effectively just runs K1 and then runs K2 on the output, all on the GPU in fast memory -- without the needless step of handing control and data back to the host, which will immediately hand them right back to the GPU again.

So, if you're using the kind of python/C++/CUDA library mentioned above, you will naturally end up trying to minimize the number of

(python -> C++ -> CUDA -> C++ -> python)

round trips you're doing. Not because of any property of python, or even of C++, but because any time you have something like

(not CUDA -> CUDA -> not CUDA)

you want to do it as rarely as possible. Every time you end up in "CUDA" you want to do as much as you can there before leaving, that's just how the game works. Every time you're in "CUDA," it's a privilege, paid for by an expensive pair of "-> CUDA" and "CUDA ->" arrows. You want to make it worth the cost of admission -- to see all the rides, play all the games, stay all day.

----

This is one reason that python overheads are not a big deal in ML: the same optimizations you would be doing anyway happen to minimize python time as an automatic side effect.

But in fact, the situation is even nicer than that! Even if you are not going out of your way to fuse kernels and stuff, overheads are not usually a big deal in the first place. Consider:

First, most of the computational heavy lifting in ML consists of multiplying large matrices together. This is a single-kernel operation already, no fusion required.

And it's an expensive operation, both in terms of runtime (big matrices are slow to multiply) and in terms of memory copies (you have to copy the big matrices).

In practice, if you just do the default thing without any fancy optimizations, you will do one "round trip" per big-matrix-multiplication. That is, your python code is saying "hey, multiply these two big matrices" over and over.

Even if you incur a bit of overhead each time "python says to do something," actually doing the thing takes a lot of time. So the overhead is a much smaller percentage of runtime than you might expect.

Second, it is not actually true that you incur a bit of overhead each time python says to do something!

CUDA kernel launches are asynchronous. When "python says to multiply two matrices," what really happens is that a kernel gets queued up, to be run when all of its inputs are ready and no previously-queued kernel is still running.

In practice, if you're doing things right, python actually runs out ahead of CUDA, issuing instructions that effectively mean "hey, compute f(x) once you've got x ready," and then "compute g(f(x)) once you've got f(x) ready," and so on.

As long as the python control flow doesn't depend on the values of things like "x" and "f(x)", you can write python code that looks like it's doing operations on data in the usual way, but is really acting on placeholders like "x, once it's ready."

Thus, the python overhead is often literally zero during the main heavy-lifting segment of the computation. Python is putting new instructions on the queue while the old ones are still executing; the queue is never empty, the GPU is never waiting on the host.

With all that said, is python overhead even a problem at all? Well, yes, actually, especially if you're not careful about it. And people who are really trying to squeeze out every possible performance gain will typically try to remove python from the pipeline, at least eventually.

But the speedup from doing this tends to be, like, 10%. (If you get even a 2x speedup from removing python here, that's amazing but also it means you were using the python library in some horribly suboptimal way and you should ashamed of yourself.) It's not the order-of-magnitude speedup usually associated with moving away from an interpreted language.

For more on all this stuff, see Horace He's instant-classic blog post Making Deep Learning Go Brrrr From First Principles. (I'm really just summarizing part of He's post, above, and his is better.)

----

There used to be a library sort of like what I described above -- but also very different, in some ways -- called "tensorflow." There still is, actually, I just forget it exists most of the time.

In tensorflow, you used a high-level language to write a static program in a DSL unique to tensorflow, a "graph" of tensorflow "ops." Then you'd run this program. Under the hood, each "op" would be a CUDA kernel or something, roughly.

This was really annoying, because you couldn't use any of the host language features in the program that ran on the accelerator. You could use python control flow to decide what tensorflow program to write, but the control flow in that program had to be written in tensorflow-ese.

So for example, there was an "op" called while_loop, where the loop condition had to be another "op." You could do a limited range of stuff that looked kind of dynamic, if it was supported in the form of something like while_loop, but mostly everything had to be known before runtime.

You would think all of these before-runtime guarantees would enable optimization and thus performance gains. Or at least I assumed so -- otherwise, what's the point? And I think in certain cases, like on TPUs rather than GPUs, this was actually true.

But then this thing arrived called "pytorch." Initially it sounded absurd to me. In pytorch, you just imperatively run python code, and it tells C++ to tell CUDA to do something. And CUDA runs a kernel and hands the output back up to python. You do this entire handoff every time you run a kernel, basically.

Surely this was too slow for real use, I thought, ignorantly. "Python is slow," after all. And surely I was getting something in exchange for the painful work of making those static tensorflow graphs, right? ...right?

Then I tried it and -- on the very first thing I tried -- it was faster than tensorflow. On the same exact model.

Because none of the stuff that I thought mattered, really mattered. I was reasoning on the basis of generalized bromides, like "interpreted languages are slow" and "static compile-time information enables optimization," rather than thinking about what my computer was really doing.

Both pytorch and tensorflow do automatic differentiation: whenever you do one of these calculations on the GPU or whatever, you have the option to also calculative derivatives of that calculation's outputs with respect to all its inputs.

In tensorflow, there's a static graph, and tensorflow just applies the chain rule to this graph.

pytorch does the same thing. But on what? You're just imperatively running one thing after another. You might even be doing it interactively.

Well, pytorch creates an ephemeral graph from the operations that python tells it to do, on the fly. When you ask for derivatives, they are computed using the graph, and then the graph is destroyed. If you do the same operations again, the same graph is constructed again from scratch. Typically you do this a huge number of times.

If you run some python control flow, and then ask for the derivative, pytorch gives you the only thing it can, which is probably what you wanted: the derivatives of the calculation that actually happened, as a result of the control flow that actually happened. If you run a while loop, and it ends up running 61 iterations, then you will get the derivative of "doing the loop body 61 times." If there's control flow in the loop body, making iteration 5 different from iteration 6, the derivative just follows whatever really happened. You can do literally anything that is possible in python, and pytorch will make the graph corresponding to the result, and then take the derivative of it.

You might think this would be wasteful and expensive. But in fact, it's fine. You can construct the exact same graph and then destroy it, hundreds of thousands of times, and you won't even notice.

A lot of the things you might think are wasteful and expensive are, in fact, fine. Conversely, the real sources of overhead are things you would never expect.

(Did you know that, except in the most recent python versions, the function "dataclasses.asdict" is very slow? I didn't, until a coworker profiled some of my code two weeks ago.)

You have to think about what your computer is really doing. There's no substitute.

----

I wrote all this to --

-- well, honestly, I mostly wrote it to amuse myself.

But also to illustrate the following point: while it is true, at a sort of CS 101 level of simplification and abstraction away from real problems, that "python is slow because it's an interpreted language (etc.)", this is just not relevant at all to the kind of high-performance computing people are doing (partially) in python.

Like, the people writing these libraries are not stupid. They've taken CS 101 just like you. Also, they are writing libraries for the most compute-intensive type of calculation that anyone does anywhere. The highest summit of "high-performance computing."

If using python was slowing down their computations, they would do something else. But it's not. They know this, even if you don't.

(I'm largely making up a guy to get mad at, I think. I think most people who say stuff like "oh, python is bad because it's slow" are implicitly thinking of other applications where this argument would be more reasonable, or just making the limited CS 101 point for its own sake. And most objections to python are about other things.

But even there, the argument generalizes, perhaps. The people who write the foundational components of large-scale machine learning systems are really smart. They know all about low-level programming, about hardware, about distributed systems, algorithms, Linux esoterica, the whole nine yards. They are, often, the sort of programmers where the term "full-stack engineer" would be an insulting understatement of their versatility.

If they use a lot of python, well, that is partly a matter of taste, and of path-dependent cultural trends. But it is also a matter of picking the right tool for the job.)

Seeing a lot of python hate on the dash today... fight me guys. I love python. I am a smoothbrained python enjoyer and I will not apologize for it

Python has multiple noteworthy virtues, but the most important one is that you can accomplish stuff extremely fast in it if you know what you are doing.

This property is invaluable when you're doing anything that resembles science, because

Most of the things you do are just not gonna work out, and you don't want to waste any time "designing" them "correctly." You can always go back later and give that kind of treatment to the rare idea that actually deserves it.

Many of your problems will be downstream from the limitations in how well you can "see" things (high-dimensional datasets, etc.) that humans aren't naturally equipped to engage with. You will be asking lots and lots of weirdly shaped, one-off questions, all the time, and the faster they get answered the better. Ideally you should be able to get into a flow state where you barely remember that you're technically "coding" on a "computer" -- you feel like you're just looking at something, from an angle of your choice, and then another.

You will not completely understand the domain/problem you're working on, at the outset. Any model you express of it, in code, will be a snapshot of a bad, incomplete mental model you'll eventually grow to hate, unless you're able to (cheaply) discard it and move on. These things should be fast to write, fast to modify, and not overburdened by doctrinaire formal baggage or a scale-insensitive need to chase down tiny performance gains. You can afford to wait 5 seconds occasionally if it'll save you hours or days every time your mental map of reality shifts.

The flipside of this is that it is also extremely (and infamously) easy to be a bad python programmer.

In python doing the obvious thing usually just works, which means you can get away with not knowing why it works and usually make it through OK. Yes, this is cringe or whatever, fine. But by the same token, if you do know what the right thing to do is, that thing is probably very concise and pretty-looking and transparent, because someone explicitly thought to design things that way. What helps (or enables) script kiddies can also be valuable to power users; it's not like there's some fundamental reason the interests of these two groups cannot ever align.

#i do not understand computers #(is a category tag)#long post

559 notes · View notes